Correlation of Term Count and Document Frequency for Google N-Grams
نویسندگان
چکیده
For bounded datasets such as the TRECWeb Track (WT10g) the computation of term frequency (TF) and inverse document frequency (IDF) is not difficult. However, when the corpus is the entire web, direct IDF calculation is impossible and values must instead be estimated. Most available datasets provide values for term count (TC) meaning the number of times a certain term occurs in the entire corpus. Intuitively this value is different from document frequency (DF), the number of documents (e.g., web pages) a certain term occurs in. We investigate the relationship between TC and DF values of terms occurring in the Web as Corpus (WaC) and also the similarity between TC values obtained from the WaC and the Google N-gram dataset. A strong correlation between the two would gives us confidence in using the Google N-grams to estimate accurate IDF values which for example is the foundation to generate well performing lexical signatures based on the TF-IDF scheme. Our results show a very strong correlation between TC and DF within the WaC with Spearman’s ρ ≥ 0.8 (p ≤ 2.2×10) and a high similarity between TC values from the WaC and the Google N-grams.
منابع مشابه
Lucene for n-grams using the CLUEWeb Collection
The ARSC team made modifications to the Apache Lucene engine to accommodate " go words, " taken from the Google Gigaword vocabulary of n‐grams. Indexing the Category " B " subset of the ClueWeb collection was accomplished by a divide and conquer method, working across the separate ClueWeb subsets for 1, 2 and 3‐grams. Phrase searching—or imposing an order on query terms—has traditionally been a...
متن کاملMinimal Perfect Hash Rank: Compact Storage of Large N-gram Language Models
In this paper we propose a new method of compactly storing n-gram language models called Minimal Perfect Hash Rank (MPHR) that uses significantly less space than all known approaches. It requires O(n) construction time and allows for O(1) random access of probability values or frequency counts associated with n-grams. We make use of minimal perfect hashing to store fingerprints of n-grams in an...
متن کاملN-grams based feature selection and text representation for Chinese Text Classification
In this paper, text representation and feature selection strategies for Chinese text classification based on n-grams are discussed. Two steps feature selection strategy is proposed which combines the preprocess within classes with the feature selection among classes. Four different feature selection methods and three text representation weights are compared by exhaustive experiments. Both C-SVC...
متن کاملUnsupervised Approaches to Text Correction Using Google N-grams for English and Romanian
We present an unsupervised approach that can be applied to test corrections tasks such as real-word error correction, near-synonym choice, and preposition choice, using n-grams from the Google Web 1T dataset. We present in details the method for correcting preposition errors, which has two phases. We categorize the n-gram types based on the position of the gap that needs to be replaced with a p...
متن کاملFrequency of Efficient Circulating Follicular Helper T Cells Correlates with Dyslipidemia and WBC Count in Atherosclerosis
Background: The significance of cTfh cells and their subsets in atherosclerosis is not well understood. We measured the frequency of cTfh subsets in patients with different degrees of stenosis using flow-cytometry. Methods: Participants included high (≥50%; n = 12) and low (<50%; n = 12) stenosis groups, as well as healthy controls (n = 6). Results: The frequency of CCR7loPD-1hiefficient-cTfh w...
متن کامل